4. Main Analysis (Exploratory Data Analysis)

The topic of interest is NYC subway, specifically Manhattan. There were questions on NYC subway that we wanted to find answers for, such as which time period was the busiest, which stations were the busiest, difference in human traffic patterns between weekdays and weekends. Then, there were other questions related to NYC subway that we wanted to explore, e.g., which station had the highest crime rate, whether or not weather (rainfall) affected subway travelling patterns, and correlation (positve or negative) between taxi and subway usage.

Therefore, we have four main groups of data: NYC subway turnstile counter readings, crime data, weather data and taxi data.

We start by exploring subway turnstile data on its own, before we move on to looking at its relationships with other variables that we are interested in (i.e., crime, weather, taxi).

4.1 Static illustration of turnstile data

library(tidyverse)
# read data
turnstile = read.csv("data/2015_manhattan_turnstile_usage.csv")

4.1.1 Average by day of week

# GroupBy 1.day & 2.interval --> average entry & exit volume
data1 <- turnstile %>% select(interval, day, entry_volume, exit_volume)   %>% group_by(day, interval) %>% summarise(avg_entry = mean(entry_volume), avg_exit = mean(exit_volume))
# Reoreder by day & interval
data1$day <- factor(data1$day, c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
data1$interval <- factor(data1$interval, c("08PM-12AM","04PM-08PM","12PM-04PM","08AM-12PM","04AM-08AM","12AM-04AM"))
ggplot(data1, aes(y = avg_entry, x = interval)) + 
  geom_col(col='#0072B2', fill="#66CC99")  + ylab("Entry Count") + xlab("Interval") + facet_wrap(~ day) + coord_flip()

ggplot(data1, aes(y = avg_exit, x = interval)) + 
  geom_col(col='#0072B2', fill='#E69F00')  + ylab("Exit Count") + xlab("Interval") + facet_wrap(~ day) + coord_flip()

  • First of all, we would see how the average number of entry & exit changes over time of a day, faceted by day of week. Variables, “day” & “interval”, are releveled by the order of time so that we can easily catch the trend along with time.
  • The main feature detected through following plots is that the peak time for entry is between “4pm - 8pm” and the peak time for exit is between “8am - 12pm”.
  • The insights we can get from that feature is that people are coming into Manhattan from outside of Manhattan between “8am - 12pm”, which is the peak time of exit. Also, lots of people are going out from Manhattan between “4pm - 8pm”, which is the peak time of entry.

4.1.2 The effect of weekday & weekend

  • In the second part of static analysis of subway data, we are exploring to see how the trend of subway usage changes by weekday and weekend.
  • In 2-1, we are looking at the general trend of subway entry & exit over a week. Then, in 2-2, we will see the difference between weekday and weekend.

4.1.2-1 Average by all

# GroupBy interval --> average entry & exit volume
data2_1 <- turnstile %>% select(interval, entry_volume, exit_volume)   %>% group_by(interval) %>% summarise(avg_entry = mean(entry_volume), avg_exit = mean(exit_volume))
# Reoreder interval
data2_1$interval <- factor(data2_1$interval, c("08PM-12AM","04PM-08PM","12PM-04PM","08AM-12PM","04AM-08AM","12AM-04AM"))
ggplot(data2_1, aes(y = avg_entry, x = interval)) + 
  geom_col(col='#0072B2', fill="#66CC99")  + ylab("Entry Count") + xlab("Interval") + coord_flip()

ggplot(data2_1, aes(y = avg_exit, x = interval)) + 
  geom_col(col='#0072B2', fill='#E69F00')  + ylab("Exit Count") + xlab("Interval")  + coord_flip()

  • To see the subway usage trend over a week, we grouped data by interval without considering day of week and averaged count of entry and exit by corresponding interval of time. Again, to see the changes in entry and exit over time, we releveled variable “interval” in an order of time.
  • As we saw in part 1, the peak time for exit is between “8am - 12pm” and the peak time for the entry is between “4pm - 8pm” and the entry, which correnspond to the time people coming into Manhattan for their work and going back to home ater work.

4.1.2-2 Average by weekday vs weekend & holiday

  • We further explore our data to see the difference in trend of using subway on Weekdays and Weekends.
  • For variable “day”, value of “Monday - Friday” were substituted with “Weekday”, given it’s not a holiday. If it is a holiday, “Saturday” or “Sunday”, then we substituted it with “Weekend”.
  • We grouped data by “Weekday” and “Weekend” to caculate the average number of entry and exit over interval.
  • We faceted graphs by “Weekday” and “Weekend”.
# GroupBy 1.day & 2.interval --> average entry & exit volume
turnstile$is_holiday <- as.character(turnstile$is_holiday)
data2_2 <- turnstile %>% select(interval, day, is_holiday, entry_volume, exit_volume)   %>% group_by(day, is_holiday, interval) %>% summarise(avg_entry = mean(entry_volume), avg_exit = mean(exit_volume))

# Change the value of "day" to "Weekday" or "Weekend"
day_list = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
for (i in day_list){
  if (i != "Saturday" & i != "Sunday"){
    data2_2[,"day"] <- data.frame(lapply(data2_2[,"day"], function(x) {gsub(i, "Weekday", x)}))
  }
  else {
    data2_2[,"day"] <- data.frame(lapply(data2_2[,"day"], function(x) {gsub(i, "Weekend", x)}))
  }
}

data2_2 <- data2_2 %>% ungroup() %>% mutate(day = if_else(day == "Weekday" & is_holiday == "False", "Weekday", "Weekend"))

data2_2 <- data2_2 %>% group_by(day, interval) %>% summarise(avg_entry = mean(avg_entry), avg_exit = mean(avg_exit))

data2_2$interval <- factor(data2_2$interval, c("08PM-12AM","04PM-08PM","12PM-04PM","08AM-12PM","04AM-08AM","12AM-04AM"))
ggplot(data2_2, aes(y = avg_entry, x = interval)) + 
  geom_col(col='#0072B2', fill="#66CC99")  + ylab("Entry Count") + xlab("Interval") + facet_wrap(~ day) + coord_flip()

ggplot(data2_2, aes(y = avg_exit, x = interval)) + 
  geom_col(col='#0072B2', fill='#E69F00')  + ylab("Exit Count") + xlab("Interval") + facet_wrap(~ day) + coord_flip()

  • The first interesting observation we made here is the difference in the number of subway usage during weekday and weekend. During weekdays, the average number of both entry and exit of subway is significantly higher than that of Weekends. Since there are lots of travelers in Manhattan over the year, we expected there would be some difference in trend but not in the count of subway usage. However, following plots show that there is actually a huge gap between subway usage count during weekdays and weekends.
  • For the second observation, trends in subway usage is changing from weekdays to weekends. As we observed in the previous parts, peak time for subway exit and enty is corresponding to the time people going to work and coming back home. However, during the weekend, there is no such trends anymore. The peak time for both entry and exit is between “12pm - 4pm”. It is possibly because that people are not going to their work during weekends and the pattern from people who are commuting from outside of Manhattan is less reflected on data for weekend tims.
  • There is another interesting observation from the exit graph. During weekdays, there are lots of people exiting subway between “4am - 8am” but much less number of people are exiting subway during weekend around that time. This is possibly because that there are lots of people going to the office in early morning during weekdays.

4.1.3 Average by station & interval

  • For people who are stranger in Manhattan and hate crowed place, we’ve have the top 5 crowed stations that you should avoid in terms of exit and entry.
data3 <- turnstile %>% select(station, station_id, entry_volume, exit_volume)  %>% mutate(station_unique = paste(station, station_id)) %>% group_by(station_unique) %>% summarise(avg_entry = mean(entry_volume), avg_exit = mean(exit_volume))
data3 %>%
  ungroup() %>%
  arrange(avg_entry) %>%
  mutate(station_unique = reorder(station_unique, avg_entry)) %>% tail(5) %>%
  ggplot(aes(y = avg_entry, x = station_unique)) + 
  geom_col(col='#0072B2', fill="#66CC99")  + ylab("Entry Count") + xlab("Station")  + coord_flip()

data3 %>%
  ungroup() %>%
  arrange(avg_exit) %>%
  mutate(station_unique = reorder(station_unique, avg_exit)) %>% tail(5) %>%
  ggplot(aes(y = avg_exit, x = station_unique)) + 
  geom_col(col='#0072B2', fill="#E69F00")  + ylab("Exit Count") + xlab("Station")  + coord_flip()

  • Above graphs show that “Grand Central”, “Herald Square”, “Union Square”, “Port Authority” and “Time Square” are the most crowded stations in Manhattan.
  • Note that there are two Grand central stations in above graph. Since when we are preprocessing the subway data with longitude and latitude information, we gave the unique station id for entrances that are one or more blocks away.
  • Even though the counts for Grand Central station is seperated into parts, it is ranked as the most crowed station.

4.2 Static illustration of crime, weather (rainfall) and subway traffic data

Now that we have looked at subway human traffic data on its own, we move on to explore its relationship with crime (those committed in subway stations) and weather (rainfall).

First, we look at which subway stations are the most “dangerous” (!) and which are the safest.

4.2.1 Bar charts of crime numbers against subway stations

By looking at these bar charts displaying the top 5 stations in terms of crime count, we see that the crime count by station for the 3 different crime types were closely related. For example, 125 ST (line 4, 5, 6) and 23 ST were in the top 5 for all 3 types of crime. We see that the similarity between misdemeanor and violation was strongest among the 3 possible pairings of crime type (i.e., felony-misdemeanor, felony-voilation, misdemeanor-violation). In a way, this was not surprising because misdemeanor and violation are more similar than felony which is a more serious type of crime.

Out of interest, we also looked at the 10 “safest” stations. The patterns are a little harder to infer compared with the most dangerous stations because there are many ties. Some interesting observations include: wall street station is one of the safest, whether we are looking at felony or misdemeanor. And Columbia on 116th street is also one of the safest for misdemeanor!

4.2.2 Box plots of crime numbers by time of the day

For overall crime count, the median was rather consistent across the time periods. The number of crimes committed was highest from 1600-2000hr, which coincided with the evening peak period. Based on that, we expected the morning peak (0800-1200hr) to display the next highest crime count, but the data did not support that. Instead, 1200-1600hr showed the second highest crime count based on median. Also, variance (as indicated by length of the box) was highest for the time periods with the highest median crime count. The patterns we saw in the overall count were similarly visible for felony and misdemeanor. For violation, the only similarity with the other crime types was that 1600-200hr was the period with the highest crime rate.

4.2.3-1 Scatter plots of Subway Human Traffic against Crime Count (by weekend, weekday; by crime type)

One data point = one day of the year

For the above and subsequent similar scatter plots, each data point on the graph represented a single day (e.g., total human traffic across all stations for that day, total crimes committed across all stations for that day).

We created the above scatter plots to investigate if there was a relationship between human traffic and crime rate at subway stations.

We plotted the charts separately by weekday and weekend to isolate any effects that weekday vs weekend might have on the relationship between human traffic count and crime count.

Overall, there was a positive correlation between human traffic and crime count for both weekday and weekend. A similar pattern was observed for felony and misdemeanor. However, the relationship was less obvious for violation for which the sample sizes were small.

4.2.3-2 Scatter plots of Subway Human Traffic against Rainfall (by weekend, weekday)

One data point = one day of the year

The above scatter plots investigated if there was a relationship between rainfall and human traffic at subway stations.

For weekdays, we saw that human traffic was not much influenced by rainfall, which was not surprising because everyone had to go to work/school regardless of whether or not it was raining. For weekends, we saw a stronger negative relationship between rainfall and human traffic, which made sense because people might cancel their outdoor activities or leisure travelling plans depending on the weather.

We also saw that most of the data points were clustered around the y-axis, which was due to the fact that on most days there were no rain. On that note, we had to highlight here that the relationship that we saw here would be very susceptible to outlier effect, i.e., the regression slope that was plotted on the above graph may shift significantly if there was another outlier that, for instance, represented a day with higher rainfall and higher traffic.

4.2.3-3 Scatter plots of Crime Count against Rainfall (by crime type)

One data point = one day of the year

The above scatter plots investigated if there was a relationship between crime rate and rainfall at subway stations.

There did not seem to be a strong relationship between overall crime count and rainfall but we did notice that on days with heavy rain, crime count was never high, especially when we drilled down to look at misdemenor and violations. One explanation would be that heavy rainfall might have deterred potential offenders from travelling to the subway stations.

However, we noticed that this was not true for felony. High counts of felony were observed even for days with heavy rain, suggesting that felony was less dependent on the weather.

Previously, we saw that subway human traffic was sensitive to rainfall on weekends but not so much on weekdays. Therefore, we wondered if crime count would likewise be more sensitive to rainfall on weekends. Based on visual inspection, it did not seem like weekend changed the relationship between crime rate and rainfall.

4.2.4 Scatter plots of Crime Count against Subway Human Traffic (by crime type; by weekend, weekday)

One data point = one subway station

In the previous set of scatter plots, we were using each point to represent a day aggregating across all subway stations. For this set of scatter plts, we instead aggregated across time and let each point represent a unique subway station. The focus here was to investigate if a subway station with higher traffic also suffered from higher crime rate.

We noticed a general trend of higher crime count for stations with higher human traffic, and this was true regardless of weekday or weekend, or crime type. There were two outliers with lower traffic but very high crime count (23 ST on line 6 and 125 ST on line 4, 5 & 6), which meant that for these two stations, their higher crime rate could not be well explained by human traffic alone. Other factors affecting crime rate could be whether or not that neighborhood tended to have higher crime rate. Also, lower traffic could also work in the reverse, as a station that is more isolated may attract more potential offenders, since their crimes could be more easily committed unseen.

Lastly, the inverse relationship that we observed here between minor crime counts and rainfall may only be a proxy for the other relationships that we have previously observed (i.e., observed lower crime rate for lower traffic, lower traffic with lower rainfall).

4.2.5 Time Series of Crime Count across time; superimposed with Subway Human Traffic, Rainfall

The above set of time series was meant to explore the trend of crime across time and also to see if it varied in the same direction as the other parameters of human traffic and rainfall.

Firstly, we could see large fluctuations in crime count across time. This would not be caused by intra-day patterns because we were already looking at total crime count per day. Therefore, we looked at the crime count across time only for weekdays and then only for weekends, but the fluctations remained. It was clear that there were other factors influencing crime count not reflected in our analysis.

Ignoring the noise, it seemed like crime rate was lower near the start and end of the year with two prominent peaks between Apr and Jun.

Then, we looked at crime count by crime type. The picture was even less clear, with the more fluctuations dominanting the graph.

In an attempt to isolate the noise, we looked at the crime count for just the top 5 stations (in terms of highest crime rate). The peaks between Apr and Jun were still there but there was now a new higher peak in Nov which was not visible previously in the overall chart. The same peak could be observed when we looked at crime rate by type.

Next, we looked at human traffic and crime count across time and there seemed to be a positive correlation. This matched our earlier observation with scatter plots.

We further tried to look at rainfall with human traffic, and rainfall with crime rate, but these two graphs were dominated by large fluctuations in rainfall and were not informative. It was clear to us that scatter plot offered a better way to visualize relationship between different variables especially if we did not believe that time was a major influencing factor for two parameters (e.g., crime and rainfall) in the same way.

5. Data Pre-processing process - description of data and analysis of data quality

5.1 NYC Subway Turnstile Data

Turnstile data was collected from the following website provided by the: https://data.ny.gov/Transportation/Turnstile-Usage-Data-2015/ug6q-shqc/data

The entire sequence of data pre-processing performed on the turnstile data was done on Python and the iPython notebook is available at: https://github.com/hw2312/fourcharsunder/blob/master/data/code/CoordsManhattanStations.ipynb https://github.com/hw2312/fourcharsunder/blob/master/data/code/TurnstileUsageManhattan.ipynb

5.1.1 Quality Issue with Turnstile Usage Data - inconsistency in “timestamp”

  • We wanted to explore turnstile data with time of a day. However, the timestamp in the original turnstile data was inconsistant, which made it difficult to analyze the trend of subway usage over interval.
  • To deal with this issue, we discretized timestamp values into 6 intervals by 4 hours (i.e., 12am - 4am).

5.1.2 Quality Issue with Turnstile Usage Data - inconsistency in “cumulative count of entry and exit”

  • After we discretizing timestamp into interval, we subtracted the cumulative count of entry and exit for each station so that we can see how many people are entering into or exiting from subway stations for each interval.
  • The second issue we had is that we were getting some negative entry/exit volumes or exorbitanly high entry/exit volume.
  • The negative values seem to come from a couple possibilities:
    • Some turnstiles record entry/exit in reverse choronological order
    • Some turnstiles seem to reset their counters, which results in negative value when the value after reset is subtracted from the value before reset.
  • To deal with thess issues, we done following steps:
    • Convert all negative entry/exit volumes into positive values (assume the reverse records are mistakes)
    • Drop all values that are above the 0.9999 quantile (which is around 3000/4hr/device, or around 13/min/device), by replacing them with np.NaN.

5.1.3 Quality Issue with Turnstile Usage Data - inconsistency in “station name”

  • To interactively explore and analyze the turnstile data, we had to incorporate latitude and longitude of information into the turnstile data.
  • The approach we made is adding a column for latitude and longitude in turnstile data and fill the values by matching station name and line information from another data that contains latitude and longitude information.
  • However, station name in turnstile data had some issues:
    • Different name of station name for different entrance (uptown / downtown).
    • Inconsistency in the format (i.e, AV vs AVE).
  • To deal with these issues:
    • We tried to keep consistency in the format of station name by applying regular expression format.
    • We plotted stations on the map and merge stations which is the same but have different names.
    • We merged the stations under rule that we kept some entrance as distinct station if it is located one or more block away from the other entrances of the same station.

5.2 Crime Data

Crime data was collected from the following website provided by the city of New York: https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i

The original data set comprised 5.58M rows with 24 cols. The oldest case happened on Jan 1, 1948 and the most recent case in the dataset was on Dec, 31, 2016.

The entire sequence of data pre-processing performed on the crime data was done on Python and the iPython notebook is available at: https://github.com/hw2312/fourcharsunder/blob/master/crime/crime.ipynb

Below provides a summary:

5.2.1 Extracting only the relevant rows - 2015 data, Crime in Subway Stations, Manhattan Stations

As the most recent taxi data was for year 2015, we used the year 2015 for crime data to be consistent. After filtering for only 2015 data, the crime dataset was reduced to 477,031 rows.

We then filtered the data to include only the crimes committed at subway stations. This further reduced the number or rows down to 11,228.

Lastly, we filtered for only the stations in Manhattan. This reduced the number of rows down to 5,367.

5.2.2 Include time window and date/time conversion

As the date and time of the data set was in string type, we had to convert it to date and time format in order to filter the data by date and time. Regarding time, subway human traffic data was organized in 4 hourly interval. To facilitate analysis of crime data with subway human traffic data, we inserted an additional column in this crime dataset to indicate which 4 hourly window the crime was commmited in.

5.2.3 Remove rows with null lat /long

The crime dataset on its own did not indicate in which subway station the crimes were committed. Therefore, we wrote a program that assigned a crime location to the closest subway station based on distance calculated using lat/long difference.

For this purpose, it was therefore important that the data had lat/long entries. However, 170 of the filtered records had null for lat/long. We removed these records from our analysis.

5.2.4 Drop columns

As earlier mentioned, there were 24 columns in the dataset. Many of these columns were not necessary for our analysis. For instance, we did not need to know the date and time at which the crime was reported. We were interested in the date/time at which the crime was committed and this information was available in the dataset. We were also not interested in the codes used for each crime type which were for the police department’s internal reference. We were interested in the type of crime expressed in a language familiar to the lay person. For that, the dataset conveniently included a column to indicate which of 3 categories the crime committed was in: felony, misdemeanor, violation. There were also other columns giving a detailed description of the crime but for the purpose of faceting and grouping, we felt that these information was too granular and we did not include them. Lastly, there were also other columns indicating if the crime was committed in a park, which we did not includ as these were also irrelevant for our analysis.

In the end, we kept 12 columns the highlights of which included: crime commmited date, time, lat/long, crime category.

We added 6 columns, which as earlier mentioned were columns that we engineered to capture in which 4 hourly window the crime was committed, plus in which subway stations (based on the results of our program to estimate the subway stations based on lat/long proximity)

5.3 Weather Data

Weather data was collected from the following website provided by the National Centers for Environmental Information (NOAA): https://www.ncdc.noaa.gov/cdo-web/datatools/lcd

There was only one weather station for selection in Manhattan, located at Central Park. Since Manhattan is relatively small and it is unlikely there to be large variations in weather conditions in different parts of Manhattan, this was sufficient for the purpose of our project.

Unlike the crime data for which we had to download the entire historical set and then filter for 2015 data on our own, we were able to specify the date range within the website before we downloaded the data.

The downloaded dataset comprised 13,329 rows and 90 cols.

The weather dataset contained quite a rich set of weather-related data, such as humidity, average wind speed, monthly minimum temperature, hourly sky conditions. For this dataset, we decided to focus only on rainfall as the weather condition to use when analyzing weather’s relationship with crime and subway human traffic data.

The entire sequence of data pre-processing performed on the weather data was done on Python and the iPython notebook is available at: https://github.com/hw2312/fourcharsunder/blob/master/weather/weather_new.ipynb

Below provides a summary:

5.3.1 Include time window and date/time conversion

Similar to what we did for crime data, we converted the date and time from string type to date and time format. Likewise, we inserted an additional column in this weather dataset to indicate which 4 hourly window the weather recording was made in.

5.3.2 Drop columns

As earlier mentioned, there were 90 columns in the dataset and many of these columns were not related to rainfall and necessary for our analysis. Eventually, this dataset was transformed into one with only 3 columns, all of which were engineered features. We explain the feature engineering below.

5.3.3 Clean up rainfall readings

There were several columns representing rainfall, but we focused only on the one giving hourly rainfall (i.e., the highest time resolution). For hourly rainfall, there were readings with ‘T’ for value instead of a numerical reading. There were 1,991 such readings. ‘T’ represented ‘trace’ meaning a very small amount of rainfall which were too small for the instrument to provide an accurate reading. For these readings, we replaced them with 0.

There were also 1,761 hourly rainfall readings with a null value. Likewise, we replaced them with ‘0’.

There were hourly readings appended with the character ‘s’ to indicate that these were estimated readings. For the purpose of our project, we did not have to differentiate between estimated and actual measured rainfall. Therefore, we removed this information.

Lastly, the hourly rainfall readings were recorded as strings which we could not use for numerical operations. We converted them to numerical numbers.

Below graphic provides a quick visual snapshot of the original dataset in terms of missing value.

library(extracat)
raw_weather <- read.csv("weather/weather_data_raw.csv")
visna(raw_weather, sort = "b")

5.3.4 Aggregate rainfall readings

As consistent with the entire project, we were interested in data up to a 4hourly window resolution. The hourly rainfall data had to be converted to a similar format to facilitate our analysis. Therefore, we aggregated the rainfall readings within each 4 hourly window to get the total rainfall measured within that window.

As mentioned earlier, we transformed this weather dataset into 3 columns: they are date, 4hourly time interval and total rainfall (within that 4hourly time interval).

5.4 Taxi Data

Taxi data was collected from the following website provided by the city of New York:https://data.cityofnewyork.us/Transportation/2015-Yellow-Taxi-Trip-Data/ba8s-jw6u

2015 Yellow Taxi Trip data includes all taxi trips completed in yellow taxis in New York from January to June in 2015. This dataset is collected by technology providers authorized under the Taxicab Passenger Enhancement Program and provided to the NYC Taxi and Limousine Commission. The recorded columns are pick-up and drop-off dates and locations, trip distances, payment methods, fares, etc. There are total of 77.1 million rows. Due to the large size of the dataset, we decided to work with trips completed in February, because February has the most average daily trips, which is approximately 443,000 trips per day.

Only 1.7% of the dataset has missing pick-up and drop-off locations. And this dataset contains all taxi trips in 2015, which guarantees unbiased data. We understand that February data won’t represent the entire dataset, but for simplicity, we excluded seasonality from our analysis.

5.4.1 Extracting only the relevant columns

The purpose of using 2015 Yellow Trip data was to map pick-up and drop-off locations to the nearest subway stations in Manhattan and visualize behavioral patterns between subway riders and taxi riders. Therefore, the columns we kept were pick-up and drop-off dates and locations.

5.4.2 Remove rows with missing data

There were rows with missing pick-up/drop-off latitude and longitude entries. The proportion of the missing row is only 1.7% of the entire dataset. Given the large size of data, we removed these rows with missing data.

5.4.3 Cross Join with a file that contains Subway Station Coordinates

To find the closest subway station from taxi pick-up/drop-off locations, we calculated squared distances from pick-up/drop-off locations to all subway stations. We made a simple assumption that people could have taken a subway instead of taking a taxi by using the closest subway stations. We filtered out the minimum distances for each taxi ride and mapped pick-up/drop-off locations to corresponding subway stations.

5.4.4 Group By Station ID, Date, Hour and aggregate row counts

We transformed taxi data into the same structure as turnstile data by calculating row counts based on (Station ID, Date, Hour) pairs. These preprocess steps were done in Scala and Spark. As a result, we could compare taxi riders with subway riders based on different time of the day and days of the week.

6. Conclusion

We started this project by asking ourselves what was the topic closest to our hearts what we were most curious about regarding that topic. As in turned out, the NYC subway interested all of us immensely, it being the city’s bloodline and an integral part of our daily lives. And so we started this project, with certain answers in mind for the questions that we have but at the same time not entirely certain about what we would find, as we perform an exploratory data analysis journey with NYC subway and related datasets.

There were many findings that largely matched our collective and conventional wisdom. We reiterate here a highlight of some of such findings: Evening peak period (1600-2000hr) experienced the highest subway usage (based on turnstile data). Weekday subway usage was higher than weekend. Rainfall intensity was inversely related to subway traffic, but less so on weekdays. Volume of crime committed at subway stations was positively correlated with subway usage, regardless of whether we look at it aggregated across time for unique stations or aggregated across time for each day.

At the same time, we were pleasantly surprised by other findings that revealed new patterns/relationships that we were previously unaware of/contrary to our expectations. For most of these findings, we were able to find very reasonable explanations and if we did we explained it in this report. Examples to mention again include: Unlike evening peak, subway station entry in Manhattan during morning peak period was not high. (As mentioned in the report, we eventually linked this to the fact that most people working in Manhattan do not stay in Manhattan; therefore, they would have entered the subway system outside of and then exited at Manhattan.) Even though weekend subway human traffic was lower than weekday, it was not substantially so. (We linked this to the fact that Manhattan is a global tourist spot and there would still be many visitors from all over the world using its subway system on weekends.) We also found that crime rate was very slightly inversely related to rainfall but only for minor crimes (misdemeanor / violation); for felony, there were days with heavy rain and very high crime counts. One last finding to mention would be that there were two outlier stations (23 ST, line 6, and 125 ST, line 4, 5 & 6) with very high crime rate even though the human traffic that they experienced were not that high. We took it that there were other factors affecting crime rate which were not captured in our report. From this point, we would like to mention a few other learning points (some of which were not new to us but served as reinforcement) that we took from this project.

Firstly, causation does not mean correlation. The patterns that we found here only suggested but did not confirm a cause and effect relationship. There could be other factors at play or the pattern that we saw may only be a proxy for a deeper direct relationship.

Secondly, data pre-processing is an extremely important and time-consuming though unseen part of any data science project. As the saying goes, garbage in garbage out. In our case, we took pains to ensure that the data we used for exploration was properly cleaned. Simple example would be to convert data to the proper types and remove/impute null values. A more subtle and more challenging example was turnstile data some of which showed unusually high or low count. We uncovered such problems as visualized the data / took the range of the data column for preliminary analysis. (We attributed the turnstile count anomaly to the counter resetting itself when it reached the max, or a counter counting in the wrong direction.)

Thirdly, data exploration is not a linear process. And there are endless ways to combine data together and visualize them in different ways using different tools. There was never a single way that worked best, though certain basic data visualization principles applied (e.g., avoid using shape or color if possible). More often than not, the more combinations we tried, the more insights we got and the better we understood our data. For example, when we looked at crime count as a time-series, we noticed large fluctuations. We suspected that it was due to the weekday vs weekend effect, or due to opposing/magnifying effects from different crime types. However, after isolating for these effects, the fluctuations remained. We concluded that there were other influencing factors on crime count that were not analyzed in this project, or simply that there was a strong element of randomness to crime count that could not be explained by any factor.

And lastly, we learned to write down what we expected to see from our data but keep an open mind regarding the results. Results that did not match our expectations turned out to be just as if not more rewarding than those that did, mainly because we learned something new from the data.

This project represents the culmination of what we learned from the Exploratory Data Analysis and Visualization class; itis a fitting end to what has been a very exciting and rewarding journey for the four of us.